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Abstract. We study the decomposition of a nonnegative tensor into a minimal sum of outer 
product of nonnegative vectors and the associated parsimonious naiVe Bayes probabilistic model. 
We show that the corresponding approximation problem, which is central to nonnegative parafac, 
will always have optimal solutions. The result holds for any choice of norms and, under a mild 
. . , assumption, even Bregman divergences. 
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^ ' 1. Dedication 

■ This article is dedicated to the memory of our late colleague Richard Allan Harshman. It is 
[ loosely organized around two of Harshman's best known works — parafac [20] and LSI [H], and 

answers two questions that he posed. We target this article at a technometrics readership. 

In Section we discussed a few aspects of nonnegative tensor factorization and Hofmann's 
I PLSi, a variant of the LSI model co-proposed by Harshman [H]. In Section [5l we answered a 

question of Harshman on why the apparently unrelated construction of Bini, Capovani, Lotti, and 
Romani in [1] should be regarded as the first example of what he called 'parafac degeneracy' [28]. 
Finally in Section El we showed that such parafac degeneracy will not happen for nonnegative 
approximations of nonnegative tensors, answering another question of his. 

(N \ 

^ [ 2. Introduction 

O 

■ The decomposition of a tensor into a minimal sum of outer products of vectors was first studied by 
Hitchcock |22| [23] in 1927. The topic has a long and illustrious history in algebraic computational 

^ _ complexity theory (cf. [7] and the nearly 600 references in its bibliography) dating back to Strassen's 

^ . celebrated result j37j . It has also recently found renewed interests, coming most notably from 

Q> I algebraic statistics and quantum computing. 

' However the study of the corresponding approximation problem, i.e. the approximation of a 

tensor by a sum of outer products of vectors, probably first surfaced as data analytic models in 
psychometrics in the work of Harshman [20], who called his model parafac (for Parallel Factor 
Analysis), and the work of Carrol and Chang [8], who called their model candecomp (for Canonical 
5^ I Decomposition). 

The candecomp/parafac model, sometimes abbreviated as CP model, essentially asks for a 
solution to the following problem: given a tensor A G ^dix---xdk ^ gj^^j o^j^ optimal rank-r approxi- 
mation to A, 

(1) Xr e argminj.ank(X)<rP - ^11, 

or, more precisely, find scalars Ap and unit vector^ Up, Vp, . . . , Zp, p = 1, . . . , r, that minimizes 

(2) A - ^'^^ Ap Up O Vp 



X 
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-'^Whenever possible, we will use Up, Vp, . . . , Zp instead of the more cumbersome , Up^' , . . . , Up*' to denote the 
vector factors in an outer product. It is to be understood that there are k vectors in "up, Vp, . . . , Zp," where fe > 3. 
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The norm || • || here is arbitrary and we wih discuss several natural choices in the next section. 
When k = 2, A becomes a matrix and a solution to the problem when || • || is unitarily invariant is 
given by the celebrated Eckart- Young theorem: X,. may be taken to be 



CTr, 



where o"i > • • • > o"r are the first r singular values of A and Up, Vp the corresponding left and right 
singular vectors. 

However when fc > 3 the problem becomes more subtle. In fact, a global minimizer of ([2]) may not 
even exist as soon as A; > 3; in which case the problem in ([T|) is ill-posed because the set of minimizers 
is empty. We refer the reader to Section [5] for examples and discussions. Nevertheless we will show 
that for nonnegative tensors the problem of finding a best nonnegative rank-r approximation always 
has a solution, i.e. ([2|) will always have a global minimum when A and Up, Vp, . . . , Zp are required to 
be nonnegative. Such nonnegativity arises naturally in applications. For example, in the context 
of chemometrics, sample concentration and spectral intensity often cannot assume negative values 
[3 [U [9l [271 E2I [M]- Nonnegativity can also be motivated by the data analytic tenet [30] that the 
way 'basis functions' combine to build 'target objects' is an exclusively additive process and should 
not involve any cancellations between the basis functions. For k = 2, this is the motivation behind 
nonnegative matrix factorization (nmf) [301I34|. essentially a decomposition of a nonnegative matrix 
A € M"*^" into a sum of outer-products of nonnegative vectors, 

A = WH^ = y^ _^ Wp (g) hp, 

or, in the noisy situation, the approximation of a nonnegative matrix by such a sum: 



minvy>o,//>oP - WH^\\ = minwp>o,hp>o 



r 



The generalization of nmf to tensors of higher order yields a model known as nonnegative parafac 
[9| 1271 132]. which has also been studied more recently under the name nonnegative tensor factor- 
ization (ntf) [35]. As we have just mentioned, a general tensor can fail to have a best low-rank 
approximation. So the first question that one should ask in a multilinear generalization of a bi- 
linear model is whether the generalized problem would still have a solution — and this was the 
question that Harshman posed. More generally, we will show that nonnegative parafac always 
has a solution for any continuous measure of proximity satisfying some mild conditions, e.g. norms 
or Bregman divergences. These include the sum-of-squares loss and Kullback-Leibler divergence 
commonly used in NMF and NTF. 

The following will be proved in Sections [6] and [71 Let ^ ^ ^ j^dix - xdfc closed convex 
subsets. Let d : O x i^o ~^ be a norm or a Bregman divergence. For any nonnegative tensor 
^ € and any given r € N, a best nonnegative rank-r approximation always exist in the sense that 
the following infimum 

mi{d{A,X) I X G J]o,rank+(X) < r} 

is attained by some nonnegative tensor X,. G Oqi rank_|_(Xr) < r. In particular, the nonnegative 
tensor approximation problem 

Xr G argmin,ank+(X)<rP - -^11 

is well-posed. Here rank_|_(X) denotes the nonnegative rank of X and will be formally introduced 
in Section m 

3. Tensors as hypermatrices 

Let Vi, . . . , Vfc be real vector spaces of dimensions di, . . . ,dk respectively. An element of the 
tensor product Vi (8) • • • ® Vfc is called an order-A; tensor. Up to a choice of bases on Vi, . . . , T4, such 
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a tensor may be represented by a di x • • • x array of real number^, 

(3) ^ = Ki---ijfc±=iel^'^ 

Gelfand, Kapranov, and Zelevinsky called such coordinate representations of abstract tensors hy- 
permatrices [19]. It is worth pointing out that an array is just a data structure but like matrices, 
hypermatrices are more than mere arrays of numerical values. They are equipped with algebraic 
operations arising from the algebraic structure of Vi (8) ■ ■ ■ (8) V/; : 

• Addition and Scalar Multiplication: For [aji --jfel, [&ji-- jfcl ^ ]^dix-xdk g^j^^j g 

(4) AK...,J + /i[^...,J = [Aa,,...,, + /x6,,...,J G R'^ix-xd.. 

• Outer Product Decomposition: Every A = [0^^...^^]] € M'^i^ ^x'^fe may be decomposed as 

(5) A = y^,^_^ Ap Up (g) Vp (g) • • • (g) Zp, «ji---jfe = ^p^j^ \'^pji'^pj2 ■ ■ ■ ^pjk-: 

with Ap G M, Up = [upi, . . . ,UpdJ^ G M"'!, . . . , Zp = [zpi, . . . j^pdj"^ G M^*, p = 1, . . . ,r. 

The symbol denotes the Segre outer product: For vectors x = [xi, . . . ,xi\^ G M', y = 
[yi, . . . , Vm]^ G W^, z = [zi, . . . , Zn]^ G M*^, the quantity x y Cg) z, is simply the 3-hypermatrix 
l,XiyjZk}\'^jJ!^i G k'>^™-x"^ with obvious generalization to an arbitrary number of vectors. 

It follows from ([5]) that j^'^ix - x'^fc is a vector space of dimension di ■ ■ ■ dk- The existence of a de- 
composition ([5]) distinguishes M'^^x - xdfc fj-om being merely a vector space by endowing it with a ten- 
sor product structure. While as real vector spaces, M'xmxn ^hypermatrices), ]R'"^xn^ ]^/nxm^ j^mnx/ 
(matrices), and M'™"- (vectors) are all isomorphic, the tensor product structure distinguishes them. 
Note that a different choice of bases onVi,...,Vk would lead to a different hypermatrix representa- 
tion of elements in (g ■ • ■ (g) 14. So strictly speaking, a tensor and a hypermatrix are different in the 
same way a linear operator and a matrix are different. Furthermore, just as a bilinear functional, 
a linear operator, and a dyad may all be represented by the same matrix, different types of ten- 
sors may be represented by the same hypermatrix if one disregards covariance and contravariance. 
Nonetheless the term 'tensor' has been widely used to mean a hypermatrix in the data analysis 
communities (including bioinformatics, computer vision, machine learning, neuroinformatics, pat- 
tern recognition, signal processing, technometrics), and we will refrain from being perverse and 
henceforth adopt this naming convention. For the more pedantic readers, it is understood that 
what we call a tensor in this article really means a hypermatrix. 

A non-zero tensor that can be expressed as an outer product of vectors is called a rank-1 tensor. 
More generally, the rank of a tensor A = lcLjj^...ji,J'^^'"''j^^-^ G M'^ix-'-xcifc^ denoted rank(A), is defined 
as the minimum r for which A may be expressed as a sum of r rank-1 tensors \22\ I23j. 



(6) rank(74) := min|r A = ^ ApU 



\p u.p 

The definition of rank in ^ agrees with the definition of matrix rank when applied to an order-2 
tensor. 

The Frobenius norm or F-norm of a tensor A = laj,...jjf^i;;;2^^ G M'^ix-xd^ defined by 

(T) pii-=[E2:.ii"'^---"f • 

The F-norm is by far the most popular choice of norms for tensors in data analytic applications. 
However when A is nonnegative valued, then there is a more natural norm that allows us to interpret 
the normalized values of A as probability distribution values, as we will see in the next section. 
With this in mind, we define the E-norm and G-norm by 

II A II •sr^dx,...,du I I 

(8) ii^ii^ = E,,..,,=J«..-.J 



^The subscripts and superscripts will be dropped when the range of j'l, . . . , jfe is obvious or unimportant. We use 
double brackets to delimit hypermatrices. 
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and 

\\A\\g = max{|aj,...jj | ji = 1, . . . , di; . . . ; jfc = 1, . . . ,4}. 

Observe that the E-, F-, and G-norms of a tensor A are simply the l^-, t^-, and /°°-norms of A 
regarded as a vector of dimension d\ - ■ ■ dk- Furthermore they are multiphcative on rank-1 tensors 
in the following sense: 

(9) ||u (g) V (g) • • • (g) zIIe = ||u||i||v||i • • • ||z||i, 

||u(g)V(g)---0z||i? = ||u||2||v||2---||z||2, 
||U (g) V ig) ■ ■ ■ (g) z||g = ||u||oo||v||oo • ■ ■ ||z||oo- 

The F-norm has the advantage of being induced by an inner product on i^'^ix-'-x'^fc^ namely, 

(10) (AS) = 2^'"'''f_ a,,..,,6,,...,,. 

As usual, it is straightforward to deduce a Cauchy-Schwarz inequality 

\{A,B)\ < \\A\\f\\B\\f, 

and a Holder inequality 

\{A,B)\ < \\A\\e\\B\\g. 

Many other norms may be defined on a space of tensors. For any 1 < p < oo, one may define the 
^P-equivalent of ([Tj), of which E-, F-, and G-norms are special cases. Another common class of 
tensor norms generalizes operator norms of matrices: For example if ^ = |ajjfc] € m'^"^^" and 

^(x,y,z) := > . _ aijkXiVjZj^ 
denotes the associated trilinear functional, then 

WAW - suD '-^(^'y'^)' 

||^||p,(?,r •— sup - - - - - — 
x,y,Z7^0 ll-^llp l|y llg ll^ll''' 

defines a norm for any 1 < p,q,r < oo. Nevertheless all these norms are equivalent (and thus induce 
the same topology) since the tensor product spaces here are finite-dimensional. In particular, the 
results in this paper apply to any choice of norms since they pertain to the convergence of sequences 
of tensors. 

The discussion in this section remains unchanged if M is replaced by C throughout (apart from a 
corresponding replacement of the Euclidean inner product in (jlOp by the Hermitian inner product) 
though a minor caveat is that the tensor rank as defined in ^ depends on the choice of base fields 
(see [13] for a discussion). 

4. NONNEGATIVE DECOMPOSITION OF NONNEGATIVE TENSORS 

We will see that a finite collection of discrete random variables satisfying both the naive Bayes 
hypothesis and the Ockham principle of parsimony have a joint probability distribution that, when 
regarded as a nonnegative tensor on the probability simplex, decomposes in a nonnegative rank- 
revealing manner that parallels the matrix singular value decomposition. This generalizes Hof- 
mann's probabilistic variant [2l] of latent semantic indexing (lsi), a well-known technique in nat- 
ural language processing and information retrieval that Harshman played a role in developing |14j . 
Nonnegative tensor decompositions were first studied in the context of parafac with nonnegativity 
constraints by the technometrics communities O El [9l [271 [32] . The interpretation as a naive Bayes 
decomposition of probability distributions into conditional distributions was due to Garcia, Still- 
man, and Sturmfels [17] and Sashua and Hazan [35] . It is perhaps worth taking this opportunity to 
point out a minor detail that had somehow been neglected in |17[l35j: the naive Bayes hypothesis is 
not sufficient to guarantee a nonnegative rank-revealing decomposition, one also needs the Ockham 
principle of parsimony, i.e. the hidden variable in question has to be minimally supported. 
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A tensor A = lajl■■■j^,j'^l'"''j^=l G 'j^'^^-^-'-^'^k nonnegative, denoted ^ > 0, if all aj^...j^ > 0. 
We will write — {A G ^.d^x-xd^ | ^ > 0}. For A > 0, a nonnegative outer-product 

decomposition is one of the form 

_^ (5p Up (g) Vp (g) • • • (g) Zp 

where 6p >0 and Up, Vp, . . . , Zp > for p = 1, . . . ,r. It is clear that such a decomposition exists for 
any A > 0. The minimal r for which such a decomposition is possible will be called the nonnegative 
rank. For A >0, this is denoted and defined via 

rank+(^) := min|r A = ^ (5p Up ® Vp ® • • • Cg) Zp, 6p, Up, Vp, . . . , Zp > for all p|. 

Let A'^ denote the unit d-simplex, i.e. the convex hull of the standard basis vectors in 1^*^+^. 
Explicitly, 

A-^ := {^'^1 SpBp e M'^+i I ^^^1 6p = 1, 5i, . . .,5^+1 > o} = {x e Mf ^ | ||x||i = 1}. 

For nonnegative valued tensors, the £^-norm has the advantage that dH]) reduces to a simple sum of 
all entries. This simple observation leads to the following proposition stating that the decomposition 
in may be realized over unit simplices if we normalize A by its iiJ-norm. 



Proposition 4.1. Let A S jg^i >< -x'^fc ^ nonnegative tensor with rank_|_(^) = r. Then there exist 
S=[5i,...,5r]'^ e Rl, Up G Vp G IR*^"S . . . , Zp G R'^^^"'^ , p=l,...,r, where 

WHi = WMe 

and 

||Up||l = ||Vp||i = • • • = ||Zp||i = 1, 

such that 

(12) ^ ^ X/ Up Vp (g) • • • (g) Zp. 

Proof. If A = 0, this is obvious. So we will suppose that A ^ 0. By the minimality of r = rank+(A), 
we know that Up, Vp, . . . , Zp in (jl2|) are all nonzero and we may assume that 

||Up||l = ||Vp||l = • • • = ||Zp||i = 1 

since otherwise we may normalize 

Up = Up/l|Upl|l,Vp = Vp/||Vp||i,. . . ,Zp = Zp/||Zp||i, 

and set 

5p = 5p||up||i||vp||i • ■ ■ ||zp||i, 
and still have an equation of the form in (|12p . It remains to show that 

ll^lli = WMe- 

Note that since all quantities involved are nonnegative, 

Mll-B = (5„ Up (g Vp (g) • • • (g) Zp = 5p||u„ ig) Vp (g) • • • (g Zpll^;. 

II Z — /p=l f r r I- ^ ^ — ^p=l 

By ([9]), the RHS can be further simplified to 

^^^^(5p||up|Ii||vp||i--- ||zp|[i = ^^^^5p = 
as required. □ 
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Note that the conditions on the vectors imply that they he in unit simphces of various dimensions: 

(13) ui,...,u^ e A'^i-\ vi,...,v^ G A'^^-i^ ^^^^ zi,...,z^ G A'^*"^ 

For k = 2, the above decomposition is best viewed as a parallel to the singular value decompo- 
sition of a matrix A £ M*"^*^, which is in particular an expression of the form 

(14) A = y^^__^ CTp Up (g) Vp, 
where r = rank(74), 



'^2 



A\\p, and ||up||2 = ||vp||2 = 1 



for all p = 1, . . . , r. Here a = [cJi, . . . ,0"^]^ G W is the vector of nonzero singular values of A. 
If A is normalized to have unit F-norm, then all quantities in ()14p may be viewed as living in 
unit spheres of various dimensions: A G §™"-i^ a g S''""^, Ui, . . . , G S"^~"^, vi, . . . , G §"~^ 
where S'^""'^ = {x G M'^ | ||x||2 = 1} is the unit sphere in M"'. For k = 2, the nonnegative matrix 
decomposition in Proposition 14. 1 1 is one where the unit spheres are replaced by unit simplices and 
the and i^'-norms replaced by the l^- and £'-norms. An obvious departure from the case of SVD 
is that the vectors in (jl3p are not orthogonal. 

Henceforth when we use the terms ntf and NMF, we will mean a decomposition of the type in 
Proposition 14. 1[ For a nonnegative tensor with unit i5^-norm, A G A^^" '^''"^, the decomposition in 
Proposition 14.11 has a probabilistic interpretation. 

Let U,V, . . . , Z he discrete random variables and q{u, v, . . . ,z) = Pr(?7 = u,V = v, . . . , Z = z) 
be their joint probability distribution. Suppose U,V, . . . , Z satisfy the naive Bayes hypothesis, i.e. 
they are conditionally independent upon a single hidden random variable 0. Let qi{u \ 6),q2{v \ 
9), . . . , qk{z I 6) denote respectively the marginal probability distributions oi U,V, . . . , Z conditional 
on the event Q = 9. Then the probability distributions must satisfy the relation 

(15) q{u, v,...,z)= Y,]^, m Qiiu I 0)q2{v \ 9) ■ ■ ■ qk{z \ 9) 

where 5{9) = Pr(0 = 9). Since the discrete random variables U,V, . . . , Z may take di,d2, ■ ■ ■ ,dk 
possible values respectively, the Bayes rule in (jl5p can be rewritten as the tensor decomposition 
in (jl2p . provided we 'store' the marginal distributions qi{u \ 9),q2{v \ 9),...,q}^{z \ 9) in the 
vectors ug, vg, . . . , z^/ respectively. The requirement that r = rank+(^) corresponds to the Ockham 
principle of parsimony: that the model (|15p be the simplest possible, i.e. the hidden variable G be 
minimally supported. 

For the case k = 2, (jlSp is Hofmann's plsi [23], a probabilistic variant of latent semantic indexing 
|14j . While it is known [18] that the multiplicative updating rule for NMF with KL divergence in 
[30j is equivalent to the use of em algorithm for maximum likelihood estimation of plsi in |24j . 
this is about the equivalence of two algorithms (em and multiplicative updating) applied to two 
approximation problems (maximum likelihood of plsi and minimum KL divergence of nmf). Since 
the EM algorithm and the nmf multiplicative updating rules are first-order methods that can at best 
converge to a stationary point, saying that these two algorithms are equivalent for their respective 
approximation problems does not imply that the respective models are equivalent. The preceding 
paragraph states that the probabilistic relational models behind plsi and ntf (and therefore nmf) 
are one and the same — a collection of random variables satisfying the naive Bayes assumption 
with respect to a parsimonious hidden variable. This is a statement independent of approximation 
or computation. 

5. Nonexistence of globally optimal solution for real and complex tensor 

approximations 

A major difficulty that one should be aware of is that the problem of finding a best rank-r 
approximation for tensors of order 3 or higher has no solution in general. There exists A G ]R<^i>< "^<^fc 
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such that 
(16) 



inf 



p=i 



XpUp 



It is also in general not possible 



is not attained by any choice of Ap, Up, Vp, . . . , Zp, p = 1, . . . 
to determine a priori if a given A S M'^i^ ^^'^fc will fail to have a best rank-r approximation. This 
problem is more widespread than one might imagine. It has been shown in [13] that examples of 
this failure happens over a wide range of dimensions, orders, ranks, and for any continuous measure 
of proximity (thus including all norms and Bregman divergence). Moreover such failures can occur 
with positive probability and in some cases with certainty, i.e. where the infimum in ()16p is never 
attained. This phenomenon also extends to symmetric tensors |12j . 

This poses some serious conceptual difficulties — if one cannot guarantee a solution a priori, 
then what is one trying to compute in instances where there are no solutions? We often get 
the answer "an approximate solution". But how could one approximate a solution that does not 
even exist in the first place? Conceptual issues aside, this also causes computational difficulties 
in practice. Forcing a solution in finite precision for a problem that does not have a solution 
is an ill-advised strategy since a well-posed problem near to an ill-posed one is, by definition, 
ill-conditioned and therefore hard to compute. This ill-conditioning manifests itself in iterative 
algorithms as summands that grew unbounded in magnitude but with the peculiar property that 
the sum remains bounded. This was first observed by Bini, Lotti, and Romani [2] in the context of 
arbitrary precision approximations (where this phenomenon is desirable). Independently Harshman 
and his collaborators Kruskal and Lundy [28j also investigated this phenomenon, which they called 
PARAFAC degeneracy, from the perspective of model fitting (where it is undesirable). In Theorem 
6.11 we will prove the cheerful fact that one does not need to worry about PARAFAC degeneracy 
when fitting a nonnegative PARAFAC model. 

The first published account of an explicitly constructed example of parafac degeneracy appeared 
in a study of the complexity of matrix multiplication by Bini, Capovani, Lotti, and Romani [1]. 
However their discussion was for a context entirely different from data analysis/model fitting and 
was presented in notations somewhat unusual. Until today, many remain unconvinced that the 
construction in [1] indeed provides an explicit example of parafac degeneracy and continue to 
credit the much later work of Paatero |33j . The truth is that such constructions are well-known 
in algebraic computational complexity; in addition to [T], one may also find them in [SJ [Tj [26], 
all predating [33]. As a small public servic^, we will translate the original construction of Bini, 
Capovani, Lotti, and Romani into notations more familiar to the technometrics communities. 

In [1], Bini, Capovani, Lotti, and Romani gave an algorithm that can approximate to arbitrary 
precision the product of two nxn matrices and requires only 0(n^'^^^^) scalar multiplications. The 
key to their construction is the following triplet of matrices which at first glance seem somewhat 
mysterious: 



U 



10 10 1 
e e 
110 10 



V 




-1 


-1 1 



w 









e-1 











1 










£-1 


1 








-1 



We will show that these matrices may be used to construct a sequence of tensors exhibiting parafac 
degeneracy. 



We will assume that U has a 4th row of zeros and so U,V,W & M^^^. As usual, Uij,Vij,Wij 
denote the (i,j)th entry of the respective matrices. Let re > 4 and xi,X2,X3,X4 G M" (or C"' 
linearly independent vectors. For e > 0, define 



will 

) be 



A, 



i=l 



also to fulfill, belatedly, an overdued promise made to Richard when he was preparing his bibliography on 
PARAFAC degeneracy. 
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Observe that 

Ae = (xi + X3) (g) (exi + X4) (g) (e~"^xi + X4) + X3 ® (-X2 - X4) <S) e~^xi 

+ xi (g) X4 (-e~-^xi - e"^X3) + (ex2 + X3) (g) (-exi + X2) (g) (e~"^xi + X2) 

+ (xi + 6X2) (g) (ex3 + X4) (g) (e""^X3 - X4). 

It is straight forward to verify that 

hm^^o As = A 

where 

A = Xi (g) Xi (g) Xi + Xi eg) X3 (g) X3 + X2 (g) X2 (g) Xi + X2 (g) X4 iX) X3 + X3 ® X2 (g) X2 + X3 (g) X4 (g) X4. 

Note that the sequence A^ exhibits parafac degeneracy: as e — > 0, each of the summands becomes 
unbounded in magnitude but A^ remains bounded (and in fact converges to A). 
Regardless of whether xi,X2,X3,X4 G R" or C", it is clear that for all e > 0, 

vank{Ae) < 5. 

Furthermore, one may show that rank(j4) = 6 over C and therefore rank(A) > 6 over M (cf. 
remarks at the end of Section [3]). In either case, A is an instance in ]R"><"-x" or C"^"^'^ where 
the approximation problem in ([T]) has no solution for r = 5 — since infi.jjnk(j(^)<5||A — X\\ = and 
rank(^) > 6 together imply that 

argminrank{x)<5p - -^11 = 0- 

Hence the construction in [I] also yields an explicit example of a best rank-r approximation problem 
(over M and C) that has no solution. 



6. Existence of globally optimal solution for nonnegative tensor approximations 



As we have mentioned in Section [21 nonnegativity constraints are often natural in the use of 
parafac. Empirical evidence from Bro's chemometrics studies revealed that parafac degeneracy 
was never observed when fitting nonnegative-valued data with a nonnegative parafac model. 
This then led Harshman to conjecture that this is always the case. The text of his e-mail had been 
reproduced in [31]. 

The conjectured result involves demonstrating the existence of global minimum over a non- 
compact feasible region and is thus not immediate. Nevertheless the proof is still straightforward by 
the following observation: If a continuous real-valued function has a non-empty compact sublevel 
set, then it has to attain its infimum — a consequence of the extreme value theorem. This is 
essentially what we will show in the following proof for the nonnegative parafac loss function (in 
fact, we will show that all sublevel sets of the function are compact). We will use the E'-norm in 
our proof for simplicity, the result for other norms then follows from the equivalence of all norms on 
finite-dimensional spaces. Essentially the same proof, but in terms of the more customary F-norm, 
appeared in [31]. We will follow the notations in Section |H 



Theorem 6.1. Let A G j^c^i x -xc^fe nonnegative. Then 

deR\,Up G A'^i-\...,zp G A'^^~\p= l,...,r} 



infj A - ^^^^ 5p Up (g) Vp 
is attained. 



Proof. Recall that 

or 



function / 
(17) 



X 



{x G 
•• X M 

f(.T) :-- 



X > 0} and A"-i 
M by 



{x G 



X 1 



1}. We define the 



A 
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where we let T = ((5i, . . . , (5,.; ui, vi, . . . , zi; . . . ; u^, v^, . . . , z^) denote the argument of /. Let V be 
the following subset of M"^ x {R'^^ x • • • x M'^")'^ = MKi+<ii+- +rffc)^ 

V := ]R'+ X (A'^i-^ X • • • X A'^'^-^y. 

Note that D is closed but unbounded. Let the infimum in question be fi := inf{/(T) | T € V}. We 
will show that the sublevel set of / restricted to V, 

£a = {TGV\ f{T) < a} 

is compact for all q > and thus the infimum of / on D must be attained. The set £a = 
Pn/~^(— oo, a] is closed since / is continuous (by the continuity of norm). It remains to show that 
£a is bounded. Suppose the contrary. Then there exists a sequence {Tn)^^i C V with ||T„||i — > oo 

but /(T„) < a for all n. Clearly, — > oo implies that dip^ oo for at least one q G {1,. . . ,r}. 
Note that 

/(T) > \\A\\e - ||^^^^5pUp(g) Vp0 ••• ®zp 
Since all terms involved in the approximant are nonnegative, we have 



5p Up vp ® • • • zp||^ = Y.^Zh=i ^p=i ^p'^p^^'^p^^ 



di,...,dk 



iiv,ifc=i 

rffe 



''iivJfc=i 

= dq\\Uq ® Vg (g) • • • (g) Zglls 
= • • • ||Zg||l 

= 5, 

where the last two equalities follow from Q and ||uq||i = |jvq||i = ••• = ||zg||i = 1. Hence, as 
5q^^ — > oo, f{Tn) —fQO — contradictiug the assumption that f{Tn) < a for all 7i. □ 

The proof essentially shows that the function / is coercive — a real-valued function / is said 
to be coercive for minimization if lim||x||^+oo /(x) = +oo 0]. This is a standard condition often 
used to guarantee that a continuous function on a noncompact domain attains its global minimum 
and is equivalent to saying that / has bounded sublevel sets. A minor point to note is that had 
we instead optimized over a sum of rank-1 terms u (g v (g • • • (g) z, the proof would fail because the 
vectors u, v, . . . ,z may be scaled by non-zero positive scalars that product to 1, i.e. 

au (8)/3v(8)---(8)Cz = u(g)v(g)---(8)z, aP ■ ■ ■ C, = I. 

So for example (nx) (8) y (8) (z/n) can have diverging loading factors even while the outer-product 
remains fixed. We avoided this by requiring that u, v, . . . ,z be unit vectors and having a S that 
records the magnitude. 

The following proposition provides four useful characterizations of the statement that the function 
rank+ : Mf^ x ■■■ xc^fe jg ^pper semicontinuous. This is the nonnegative rank equivalent of a similar 
result in [13]. 



Proposition 6.2. Let r. A; € N and let the topology on induced by the E-norm. The 

following statements are equivalent; and since the last statement is true by Theorem \6.1l so are the 
others. 

(a) The set S,. := {X G j rank+(X) < r] is closed. 

(b) Every A G •^'hy- — y-dk^ rank_|_(A) > r, has a best nonnegative rank-r approximation, i.e. 

mf{\\A - X\\e I rank+(X) < r} 
is attained (by some Xr with rank+(Xr) <r). 
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(c) No A G M^^^' "^ rank_|_(^) > r, can he approximated arbitrarily closely by nonnegative 
tensors of strictly lower nonnegative rank, i.e. 

ini{\\A - X\\e I rank+(X) < r} > 0. 

rank^{Xn) < r, can converge to A ^ j^dix - xdfe 



(d) No sequence (X 
iaY\kj^{A) > r 



oo 
n)n=l 



c 



with 



A-X\\< 

Tn,diX---xa!fc 



1^11} intersects 
I rank+(X) < 



Proof, (a) =^ (b): Suppose Sr is closed. Since the set {X € | || 

Sr non-trivially (e.g. is in both sets). Their intersection T = {X G 
r, 11^ — X||<||^||}isa non-empty compact set. Now observe that 

5 := inf{P - X\\ \ X & Sr} = \ni{\\A -X\\\X ^T} 

since any X' S Sr\T must have — > ||yl|| while we know that 5 < \\A\\. By the compactness 
of T, there exists X^, S T such that \\A — = 5. So the required infimum is attained by 
X^: S T C 5. The remaining implications (b) =^ (c) =^ (d) ^ (a) are obvious. □ 

In the language of [71 [37], this says that 'nonnegative border rank' coincides with nonnegative 
rank. An immediate corollary is that the E'-norm in Theorem 16.11 and Proposition 16.21 may be 
replaced by any other norm. In fact we will see later that we may replace norms with more general 
measures of proximity. 



Corollary 6.3. Let A 

arbitrary norm. Then 



inf 



A 



e: 



6p Up < 



Vp (g) 



x-xd, 



* be nonnegative and 



[0, oo) be an 



S G 



\...,ZpeA''^-\p 



is attained. 



Proof. This simply follows from the fact that all norms on finite dimensional spaces are equivalent 
and so induce the same topolog y on M^ix-x'^fc. So Proposition 16.21 holds for any norms. In 
particular, the statement (b) in Proposition 16.21 for an arbitrary norm || • || is exactly the result 
desired here. □ 



Corollary 16.31 implies that the parafac degeneracy discussed in Section [5] does not happen for 
nonnegative approximations of nonnegative tensors. There is often a simplistic view of PARAFAC 
degeneracy as being synonymous to 'between component cancellation' and thus cannot happen for 
nonnegative tensor approximation since it is 'purely additive with no cancellation between parts' 
[30[ I35j . While it provides an approximate intuitive picture, this point of view is flawed since 
PARAFAC degeneracy is not the same as 'between component cancellation'. There is cancellation in 
nx ^ytSiz — (n-|-^)x(8)y(8)z but the sequence exhibits no parafac degeneracy. Conversely, the 
sequence of nonnegative tensors 



Ar, 




1 



1 

1/n 



1 

1/n 



1/n 
l/n2 



1,2x2x2 



may each be decomposed nonnegatively as 



(18) 

with A,B,C e 
A 



p2x2x2 



1 



given by 



1 




Ar. 



B 



A + -B + —C 



n 


1 








C 








and each may in turn be decomposed into a sum of rank-1 terms. While there is no 'between 
component cancellation' among these rank-1 summands, it is known [13] that the convergence 



lim A„ 

n— >oo 



A 
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exhibits PARAFAC degeneracy over ]r2x2x2 ^2x2x2^ where there are decompositions of An different 
from the one given in (jlSp exhibiting parafac degeneracy. That such decompositions cannot 
happen over R_j_ is precisely the statement of Proposition 16.21 which we proved by way of 
Theorem 16. li 

7. Bregman divergences 

In many appUcations, a norm may not be the most suitable measure of proximity. Other measures 
based on entropy, margin, spectral separation, volume, etc, are often used as loss functions in 
matrix and tensor approximations. Such measures may not even be a metric, an example being the 
Bregman divergence [31 [T21 [22] , a class of proximity measures that often have information theoretic 
or probabilistic interpretations. In the definition below, ri(r2) denotes the relative interior of 0, i.e. 
the interior of Q regarded as a subset of its affine hull; || • || is any arbitrary norm on ^'^i^-'-^'^k — 
again the choice of which is immaterial since all norms induce the same topology on i}. 

Definition 7.1. Let 7^ C J^'^i^^'^'^fc a closed convex set. Let : 17 — > M 6e continuously 
differentiable on ri(i7) and strictly convex and continuous on Q. The function Dtp : $7 x ri(Q) —> M 
defined by 

D^{A, B) = ip{A) - ip{B) - {Vip{B),A - B) 
is a Bregman divergence if 

(i) For any fixed A ^ Q, the suhlevel set 

Co,{A) = {Xe I D^{A,X) < a} 

is bounded for all a (zR. 

(ii) Let iXn)^=i C ri(0) and A en. If 

lim„^oo||^ - Xn\\ = 0, 

then 

lim„_oo D^{A, Xn) = 0. 

(iii) Let C n{n), A en, and C n. If 

lim„_^oo||^ - Xn\\ = 0, limsup„_3oll^n|| < 00, liuin-^oo D^{An, Xn) = 0, 

then 

lim„^oo||A - ^nW = 0. 

Note that D^(j4, B) >0 and that B) = iS A = B hy the strict convexity of ip. However 

Dip need not satisfy the triangle inequality nor must it be symmetric in its two arguments. So a 
Bregman divergence is not a metric in general. 

Bregman divergences are particularly important in nonnegative matrix and tensor decomposi- 
tions |301I35|. In fact, one of the main novelty of nmf as introduced by Lee and Seung [30] over the 
earlier studies in technometrics [Oj [27l [34] is their use of the Kullback-Leibler divergence [29] as a 
proximity measur^. The kl divergence is defined for nonnegative matrices in [30] but it is straight- 
forward to extend the definition to nonnegative tensors. For A € ]^<^i ^^■■■^'^fc qt^A B S ri(]R!^^^"'^'^''), 
this is 

Z^kl(^, B) = _ k...,, log(^^) - aj,...j^ + 6,,.., J , 

where log is taken to be 0, the limiting value. It comes from the following choice of (p, 



aj^...j^\ogaj^...j^. 

Jl V Jfe = l 



We note that Kullback and Liebler's original definition [29] was in terms of probability distributions. 
The version that we introduced here is a slight generalization. When A and B are probability 



^This brought back memories of the many intense e-mail exchanges with Harshman, of which one was about the 
novelty of nmf. His fervently argued messages will be missed. 
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distributions as in Section U then ||A||£; = |1-B||_e = 1 and our definition reduces to the original one 
in [29], 

In this case Dkl{A, B) may also be interpreted as the relative entropy of the respective distributions. 

It is natural to ask if the following analogous nonnegative tensor approximation problem for 
Bregman divergence will always have a solution: 

(19) Xr e Siigmln{D^ {A,X) \ X € ri(17), rank+(X) < r). 

Clearly, the problem cannot be expected to have a solution in general since ri(r2) is not closed. For 
example \eiA = e®e®e^ M^^^^^ where e = [1, 0]^, then 

inf{£)KL {A,X) I X G i-\{M^^'^'''^),v&n\^+{X) < 1} = 

cannot be attained by any x (g) y ® z S ri(M^^^^^) since if we set x„ = y„ = z„ = then 
as n — i- oo, 

Dkl [A x„ (g) y„ (g) z„) = ^ ^ 0. 

This is simply a consequence of the way a Bregman divergence is defined and has nothing to do 
with any peculiarities of tensor rank, unlike the example discussed in Section [5j This difficulty may 
be avoided by posing the problem for any closed (but not necessarily compact) subset of ri(r2). 

Proposition 7.2. Let Q be a closed convex su bsetofR'^'' ''"^'' andAe n. Let : SI x ri(J7) 
be a Bregman divergence. Then 

(20) inf{D^ {A, X)\X eK, rank+(X) < r} 
is attained for any closed subset K C ri(r2). 

Proof. Recall that Sr '■= {X G j^^i^^-'-x'^fe j rank4.(X) < r}. The statement is trivial if rank+(A) < 
r. So we will also assume that rank_|_(A) > r + 1. Let /_i be the infimum in (j20p and let a > /z. By 
(i) in Definition 17.11 the sublevel set Cci{A) is bounded and so its subset 

KnSrn Ca{A) = {x eKr\Sr\ d^{A, x) < a} 

must also be bounded. Note that K fi Sr is closed. Since ip is continuously differentiable on ri(Q), 
the function X i— > D^{A, X) is continuous and so K D Sr ri Ca{A) is also closed. Hence Dip{A, X) 
must attain /_f on the compact set K OSr H Ca{A). □ 

As one can see from the proof. Proposition 17.21 extends to any other measure of proximity d{A, X) 
where the function X i— > d{A, X) is continuous and coercive. Of course this is just a restatement 
of the problem, the bulk of the work involved is usually to show that the proximity function in 
question has those required properties. 

8. Aside: norm-regularized and orthogonal approximations 

We have often been asked about norm-regularized and orthogonal approximations of tensors 
that are not necessarily nonnegative. These approximation problems are useful in practice [10^ 
[2T| [33] . Nevertheless these always have optimal solutions for a much simpler reason — they are 
continuous optimization problems over compact feasible set, so the existence of a global minima is 
immediate from the extreme value theorem (note that this is not the case for nonnegative tensor 
approximation). In the following, we will let A G jg^i x - x^r^ j^q^ necessarily nonnegative. 

Recall that 0(n,r), the set of n x r matrices (r < n) with orthonormal columns, is compact 
in R"^''. If we impose orthonormality constraints on the normalized loading factors in dJj), i.e. 
[ui, . . . , Ur] G 0{di,r), . . . , [zi, . . . , Zr] G 0(dfc, r), then it follows that |Ap| < ||^|||' for p = 1, . . . , r. 
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i.e. A G [— ||j4||ir]^ C M^. Since the PARAFAC objective is continuous and we are effectively 
minimizing over the compact feasible region 

X 0(di,r) X ••• X 0{dk,r), 

this shows that orthogonal PARAFAC always has a globally optimal solution. 

Next, the regularization proposed in |33j is to add to the parafac objective terms proportional 
to the 2-norm of each loading factor, i.e. 

(21) ^ - y^'' ap ® bp ® • • • Cp % P V'' ^ (||ap||^ + ||bp||^ + • • • + ||cp||^). 

From constrained optimization theory, we know that, under some regularity conditions, minimizing 
a continuous function /(xi,...,Xfc) under constraints ||xj||2 = Tj, i = is equivalent to 

minimizing the functional /(xi, . . . , x^) + X^*Li ftllxilll appropriate . . . , G M. In a finite 
dimensional space, the sphere of radius rj is compact, and so is the feasible set defined by ||xj||2 = Tj, 
i = 1, . . . , A;, and thus / must attain its extrema. In the same vein, ()21|) is equivalent to an equality 
constrained optimization problem and so norm-regularized parafac always has a globally optimal 
solution. This approach may also be applied to regularizations other than the one discussed here. 
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